AITopics | whittle index policy

Collaborating Authors

whittle index policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

Young Hun Jung, Ambuj Tewari

Neural Information Processing SystemsOct-2-2025, 11:24:06 GMT

Neural Information Processing Systems http://nips.cc/

bandit, data mining, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Michigan (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.94)
Information Technology > Data Science > Data Mining > Big Data (0.65)

Add feedback

Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability

Meshram, Rahul, Kaza, Kesav

arXiv.org Artificial IntelligenceSep-3-2025

Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem -- 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint. We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.00415

Country:

North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)

Add feedback

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Shisher, Md Kamran Chowdhury, Tripathi, Vishrant, Chiang, Mung, Brinton, Christopher G.

arXiv.org Machine LearningJun-24-2025

We consider optimal resource allocation for restless multi-armed bandits (RMABs) in unknown, non-stationary settings. RMABs are PSPACE-hard to solve optimally, even when all parameters are known. The Whittle index policy is known to achieve asymptotic optimality for a large class of such problems, while remaining computationally efficient. In many practical settings, however, the transition kernels required to compute the Whittle index are unknown and non-stationary. In this work, we propose an online learning algorithm for Whittle indices in this setting. Our algorithm first predicts current transition kernels by solving a linear optimization problem based on upper confidence bounds and empirical transition probabilities calculated from data over a sliding window. Then, it computes the Whittle index associated with the predicted transition kernels. We design these sliding windows and upper confidence bounds to guarantee sub-linear dynamic regret on the number of episodes $T$, under the condition that transition kernels change slowly over time (rate upper bounded by $ε=1/T^k$ with $k>0$). Furthermore, our proposed algorithm and regret analysis are designed to exploit prior domain knowledge and structural information of the RMABs to accelerate the learning process. Numerical results validate that our algorithm achieves superior performance in terms of lowest cumulative regret relative to baselines in non-stationary environments.

data mining, machine learning, reinforcement learning, (21 more...)

arXiv.org Machine Learning

2506.18186

Country:

Oceania > New Zealand (0.04)
Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (1.00)
Education > Educational Setting > Online (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.61)
Information Technology > Data Science > Data Mining > Big Data (0.49)
(2 more...)

Add feedback

Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

Fu, Jing, Moran, Bill, Niño-Mora, José

arXiv.org Artificial IntelligenceDec-4-2024

We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality. Jing Fu is with Department of Electrical and Electronic Engineering, School of Engineering, STEM College, RMIT University, Australia (e-mail: jing.fu@rmit.edu.au). Bill Moran is with Department of Electrical and Electronic Engineering, the University of Melbourne, VIC 3010, Australia (e-mail:wmoran@unimelb.edu.au).

bandit process, convergence, index policy, (14 more...)

arXiv.org Artificial Intelligence

2412.03326

Country:

Oceania > Australia > Victoria > Melbourne (0.24)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

Akbarzadeh, Nima, Delage, Erick, Adulyasak, Yossiri

arXiv.org Artificial IntelligenceOct-30-2024

In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments.

akbarzadeh and mahajan, optimal policy, whittle index policy, (10 more...)

arXiv.org Artificial Intelligence

2410.23029

Country:

North America > United States > Massachusetts (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Health Care Providers & Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.93)
Information Technology > Artificial Intelligence > Robots (0.92)
Information Technology > Data Science > Data Mining > Big Data (0.70)
(2 more...)

Add feedback

GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits

Chen, Gongpu, Liew, Soung Chang, Gunduz, Deniz

arXiv.org Artificial IntelligenceAug-19-2024

The restless multi-armed bandit (RMAB) framework is a popular model with applications across a wide variety of fields. However, its solution is hindered by the exponentially growing state space (with respect to the number of arms) and the combinatorial action space, making traditional reinforcement learning methods infeasible for large-scale instances. In this paper, we propose GINO-Q, a three-timescale stochastic approximation algorithm designed to learn an asymptotically optimal index policy for RMABs. GINO-Q mitigates the curse of dimensionality by decomposing the RMAB into a series of subproblems, each with the same dimension as a single arm, ensuring that complexity increases linearly with the number of arms. Unlike recently developed Whittle-index-based algorithms, GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability. Our experimental results demonstrate that GINO-Q consistently learns near-optimal policies, even for non-indexable RMABs where Whittle-index-based algorithms perform poorly, and it converges significantly faster than existing baselines.

algorithm, index policy, rmab, (12 more...)

arXiv.org Artificial Intelligence

2408.09882

Country:

Oceania > New Zealand (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Tabular and Deep Learning for the Whittle Index

Relaño, Francisco Robledo, Borkar, Vivek, Ayesta, Urtzi, Avrachenkov, Konstantin

arXiv.org Artificial IntelligenceJun-4-2024

The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

algorithm, qwinn, whittle index, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3670686

2406.02057

Country:

Asia > India (0.14)
Europe > France (0.04)
Europe > Spain > Basque Country (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.41)

Add feedback

PCL-Indexability and Whittle Index for Restless Bandits with General Observation Models

Liu, Keqin, Zhang, Chengzhong

arXiv.org Artificial IntelligenceJul-6-2023

In this paper, we consider a general observation model for restless multi-armed bandit problems. The operation of the player needs to be based on certain feedback mechanism that is error-prone due to resource constraints or environmental or intrinsic noises. By establishing a general probabilistic model for dynamics of feedback/observation, we formulate the problem as a restless bandit with a countable belief state space starting from an arbitrary initial belief (a priori information). We apply the achievable region method with partial conservation law (PCL) to the infinite-state problem and analyze its indexability and priority index (Whittle index). Finally, we propose an approximation process to transform the problem into which the AG algorithm of Ni\~no-Mora and Bertsimas for finite-state problems can be applied to. Numerical experiments show that our algorithm has an excellent performance.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2307.03034

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
North America > United States > New York (0.04)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.66)

Add feedback

On learning Whittle index policy for restless bandits with scalable regret

Akbarzadeh, Nima, Mahajan, Aditya

arXiv.org Artificial IntelligenceApr-26-2023

Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales as $\tilde O(\mathsf{S} \sqrt{\mathsf{A} T})$, where $\mathsf{S}$ is the size of the state space, $\mathsf{A}$ is the size of the action space, $T$ is the horizon, and the $\tilde{O}(\cdot)$ notation hides logarithmic terms. Due to the linear dependence on the size of the state space, these regret bounds are prohibitively large for resource allocation and scheduling problems. In this paper, we present a model-based RL algorithm for such problems which has scalable regret. In particular, we consider a restless bandit model, and propose a Thompson-sampling based learning algorithm which is tuned to the underlying structure of the model. We present two characterizations of the regret of the proposed algorithm with respect to the Whittle index policy. First, we show that for a restless bandit with $n$ arms and at most $m$ activations at each time, the regret scales either as $\tilde{O}(mn\sqrt{T})$ or $\tilde{O}(n^2 \sqrt{T})$ depending on the reward model. Second, under an additional technical assumption, we show that the regret scales as $\tilde{O}(n^{1.5} \sqrt{T})$ or $\tilde{O}(\max\{m\sqrt{n}, n\} \sqrt{T})$. We present numerical examples to illustrate the salient features of the algorithm.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2202.03463

Country:

North America > Canada > Quebec > Montreal (0.14)
Oceania > New Zealand (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Optimistic Whittle Index Policy: Online Learning for Restless Bandits

Wang, Kai, Xu, Lily, Taneja, Aparna, Tambe, Milind

arXiv.org Artificial IntelligenceMar-8-2023

Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear $O(H \sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions in $T$ episodes with a constant horizon $H$. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed from a real-world maternal and childcare dataset.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2205.15372

Country:

North America > United States (0.14)
Oceania > New Zealand (0.04)
Asia > India (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine (1.00)
Education > Educational Setting > Online (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.82)
Information Technology > Data Science > Data Mining > Big Data (0.69)

Add feedback